Status of the QCDSP project 
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We describe the completed 8,192-node, 0.4Tflops machine at Columbia as well as the 12,288-node, 0.6Tflops 
machine assembled at the RIKEN Brookhaven Research Center. Present performance as well as our experience 
in commissioning these large machines is presented. We outline our on-going physics program and explain how 
the configuration of the machine is varied to support a wide range of lattice QCD problems, requiring a variety 
of machine sizes. Finally a brief discussion is given of future prospects for large-scale lattice QCD machines. 



1. INTRODUCTION 

The large computational requirements of lat- 
tice QCD coupled with the enormous cost /per- 
formance advantages that can be obtained with 
specially configured computer hardware have en- 
couraged the design and construction of a variety 
of purpose-built machines over the past nearly 18 
years. The QCDSP machines now being com- 
pleted by our collaboration)!]] represent a contin- 
ued development in this direction. 

The most recent, 12,288-node machine at the 
RIKEN Brookhaven Research Center has a con- 
struction cost of approximately $1.8M and a peak 
speed of 0.6Tflops or a cost per peak perfor- 
mance of $3/Mfiops. At present, our most effi- 
cient routine inverts the Wilson Dirac operator. 
A complete program which carries out a Wilson- 
fcrmion, hybrid Monte Carlo evolution achieves 
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about 25% efficiency on a lattice volume of 4 4 
sites per node which corresponds to a dollar per 
delivered Mflops figure of $13.6/Mnops@. 

Achieving this level of performance required 
considerable care when writing the routine which 
applies the Wilson Dirac operator to a spinor 
field. (We expect that with more effort, this effi- 
ciency may increase somewhat further and that 
our staggered code will achieve a similar level 
of performance.) This high performance code is 
written in assembly language and uses many of 
the special hardware features provided to boost 
efficiency. However, the bulk of the conjugate 
gradient code which applies this Dirac opera- 
tor, as well as the hybrid Monte Carlo evolu- 
tion which uses the conjugate gradient inverter, is 
written in C ++ and can be relatively easily under- 
stood and modified. Thus, this 25% efficiency is 
achieved in a "user friendly" , C ++ physics soft- 
ware environment in which the carefully coded, 
high-performance routines are easily called by the 
high-level code — code which is no more difficult 
to write than that in a more standard environ- 
ment. 
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Figure 1. The 8,192-node, 0.4Tflops peak speed, 
QCDSP machine running at Columbia since 4/98. 



Table 1 



Locations of the five installed Q CD SP machines. 


Location 


Number 


Peak 




of Nodes 


Speed 


RIKEN/BNL 


12,288 


0.6Tflops 


Columbia 


8,196 


OTTflops 


SCRI/FSU 


1,024 


50Gflops 


OSU 


128 


6.4Gflops 


Wuppertal 


64 


3.2Gflops 



2. PROJECT STATUS 

We have now completed QCDSP machines at 
four of the five sites listed in Table |l|. The ma- 
chine at the RIKEN Brookhaven Research Cen- 
ter is now completely assembled, successfully runs 
our basic diagnostic code and is in a final debug- 
ging mode. The machines at the other four lo- 
cations have been running production code for 
a number of months. In particular, the 8,192- 
node machine at Columbia (Fig. has been run- 
ning quite reliably in production mode since April 
1998. There is a hardware failure roughly every 
two weeks which can usually be repaired by sim- 
ply replacing a faulty daughter board, although 
on occasion more subtle errors need to be diag- 
nosed. We anticipate that even this error rate will 
decrease as a period of infant mortality passes. 



3. CONSTRUCTION OF THE LARGE 
MACHINES 

During the past year, we have gone from oper- 
ating up to eight motherboards in an air cooled 
crate to much larger machines composed of many 
cabinets for which water cooling is supplied. 
In order to allow the dense stacking of the 8- 
motherboard crates within these larger machines, 
we have arranged a water cooled, heat absorber 
below the fan tray that blows cooling air through 
each individual crate. 

The introduction of water cooling has caused 
some problems. The room humidity must be con- 
trolled to avoid condensation on the cold pip- 
ing within the machine. With this internally- 
provided cooling, these large computers become 
essentially decoupled from the cooling and safety 
systems routinely provided in the larger room 
housing the machine. As a result, we have in- 
stalled temperature- and smoke-sensitive safety 
systems in the large machines at both Columbia 
and Brookhaven. 

4. PHYSICS PROGRAM 

The earlier computers at Columbia were each 
configured as a single large machine and used to 
tackle large, demanding problems that typically 
had been previous explored on smaller machines. 
However, we are now able to exploit the eas- 
ily configurable character of the QCDSP archi- 
tecture, to adjust the hardware arrangement to 
support a variety of physics goals stretching from 
initial studies of dynamical domain-wall fermion 
thermodynamics on small, 8 3 x 4 lattices on 128- 
node, 6.4Gflops machines (we are presently run- 
ning w 9 such machines) to a continuation of a 
large, Nt = 0, 2 and 4 staggered fermion calcu- 
lation on 2 or 3 2048-node, lOOGflops machines. 
As our physics objectives mature, we expect to as- 
semble even larger units, ultimately running the 
most demanding problems on a single machine 
composed of perhaps 50-75% of all available hard- 
ware. Our 8,192 machine is presently running as 
17 separate computers on a wide range of lattices. 



5. FUTURE ARCHITECTURES 

Now we turn to a brief discussion of future ma- 
chines. In order to keep the discussion reasonably 
concrete, let us consider the design of a machine 
intended to sustain 2.0Tflops on a 32 3 x 64 lat- 
tice. We assume that there will be demanding, 
full QCD calculations that require a single large 
machine be used to thermalize and evolve a sin- 
gle, relatively small lattice. Our performance es- 
timates are based on those in |||. Two sets of 
curves are shown in Fig. ^. The right-hand family 
shows the number of processors required to sus- 
tain 2Tflops as a function of network bandwidth 
on a 32 3 x 64 lattice for preconditioned Wilson 
fermions and three different processor speeds. For 
comparison, the curves on the left correspond to a 
150Gflops machine with slower individual proces- 
sors. (For simplicity the effects of communication 
latency have been ignored in this figure.) 

For fixed speed processors and a fixed size prob- 
lem, the number of required processors falls as 
increasing bandwidth increases the over-all effi- 
ciency of the machine. Finally, when the com- 
munication time falls below that required for the 
floating point portion of calculation, the curve 
flattens and no longer depends on bandwidth. 
The solid square corresponds to 4K, 0.5Gflops- 
sustained processors with a 1Gbit network band- 
width available to each processor — a machine 
that might be constructed from standard work- 
station and network subsystems in a couple of 
years. However, including modest 20/isec network 
latency will lower the sustained speed of this ma- 
chine by at least a factor of 2. The solid trian- 
gle represents 16K, 0.15Gflops-sustained proces- 
sors with 0.64Gbit network bandwidth, a natural 
evolution of our Q CD SP architecture with 2 x en- 
hanced network. Finally the solid circle shows the 
0.15Tflops sustained RIKEN/BNL machine. 

This suggests that available technology should 
permit either style of QCD-machine to be con- 
structed. In deciding between these two ap- 
proaches one must compare the benefits of a stan- 
dard processor/workstation architecture to the 
significant cost advantages achieved with a cus- 
tom design. It is likely that a combination of the 
best of both approaches will be possible. 
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Figure 2. The number of processors needed to 
sustain 2Tflops as a function of total network 
bandwidth delivered to each node is shown by 
the curves on the right for three different proces- 
sors speeds. The left-hand curves correspond to 
a machine which sustains 0.15Tflops. 

6. CONCLUSION 

This meeting marks the completion of the cur- 
rent set of QCDSP machines under construction. 
We now look forward to a number of years dur- 
ing which these significant resources will be used 
to increase our understanding of both QCD and 
quantum field theory 
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